bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

, a sequence alignment is a way of arranging the sequences of DNA,

RNA Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...

, or protein to identify regions of similarity that may be a consequence of functional,

structural A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such a ...

, or

evolution Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...

ary relationships between the sequences. Aligned sequences of

nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...

amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...

residues are typically represented as rows within a

matrix Matrix most commonly refers to: * ''The Matrix'' (franchise), an American media franchise ** ''The Matrix'', a 1999 science-fiction action film ** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchis ...

. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a

natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...

or in financial data. Histone Alignment

Interpretation

If two sequences in an alignment share a common ancestor, mismatches can be interpreted as

point mutation A point mutation is a genetic mutation where a single nucleotide base is changed, inserted or deleted from a DNA or RNA sequence of an organism's genome. Point mutations have a variety of effects on the downstream protein product—consequences ...

s and gaps as

indel Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that ...

s (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between

s occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or

sequence motif In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ''As ...

is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose

side chain In organic chemistry and biochemistry, a side chain is a chemical group that is attached to a core part of the molecule called the "main chain" or backbone. The side chain is a hydrocarbon branching element of a molecule that is attached to a l ...

s have similar biochemical properties) in a particular region of the sequence, suggest that this region has structural or functional importance. Although DNA and RNA

bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.

Alignment methods

Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: ''global alignments'' and ''local alignments''. Calculating a global alignment is a form of

global optimization Global optimization is a branch of applied mathematics and numerical analysis that attempts to find the global minima or maxima of a function or a set of functions on a given set. It is usually described as a minimization problem because the max ...

that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. A variety of computational algorithms have been applied to the sequence alignment problem. These include slow but formally correct methods like

dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. I ...

. These also include efficient,

heuristic algorithm In mathematical optimization and computer science, heuristic (from Greek εὑρίσκω "I find, discover") is a technique designed for solving a problem more quickly when classic methods are too slow for finding an approximate solution, or whe ...

s or

probabilistic Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...

methods designed for large-scale database search, that do not guarantee to find best matches.

Representations

Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. As in the image above, an asterisk or pipe symbol is used to show identity between two columns; other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments, such as the one in the image above, color is often used to indicate amino acid properties to aid in judging the

conservation Conservation is the preservation or efficient use of resources, or the conservation of various quantities under physical laws. Conservation may also refer to: Environment and natural resources * Nature conservation, the protection and manageme ...

of a given amino acid substitution. For multiple sequences the last row in each column is often the

consensus sequence In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. It serves as a simplified r ...

determined by the alignment; the consensus sequence is also often represented in graphical format with a

sequence logo In bioinformatics, a sequence logo is a graphical representation of the sequence conservation of nucleotides (in a strand of DNA/RNA) or amino acids (in protein sequences). A sequence logo is created from a collection of aligned sequences and dep ...

in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation. Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as

FASTA format In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format a ...

and

GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...

format and the output is not easily editable. Several conversion programs that provide graphical and/or command line interfaces are available , such a
READSEQ
and

EMBOSS EMBOSS is a free open source software analysis package developed for the needs of the molecular biology and bioinformatics user community. The software automatically copes with data in a variety of formats and even allows transparent retrieval of ...

. There are also several programming packages which provide this conversion functionality, such as BioPython,

BioRuby BioRuby is a collection of open-source Ruby code, comprising classes for computational molecular biology and bioinformatics. It contains classes for DNA and protein sequence analysis, sequence alignment, biological database parsing, structural biol ...

and

BioPerl BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications. It has played an integral role in the Human Genome Project. Background BioPerl is an active open source software project sup ...

. The SAM/BAM files use the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string format to represent an alignment of a sequence to a reference by encoding a sequence of events (e.g. match/mismatch, insertions, deletions).

CIGAR Format

Ref. : GTCGTAGAATA

Read Read Read may refer to: * Reading, human cognitive process of decoding symbols in order to construct or derive meaning * Read (automobile), an American car manufactured from 1913 to 1915 * Read (biology), an inferred sequence of base pairs of ...

: CACGTAG—TA
CIGAR: 2S5M2D2M where:
2S = 2 soft clipping (could be mismatches, or a read longer than the matched sequence)
5M = 5 matches or mismatches
2D = 2 deletions
2M = 2 matches or mismatches The original CIGAR format from th
exonerate alignment program
did not distinguish between mismatches or matches with the M character. The SAMv1 spec document defines newer CIGAR codes. In most cases it is preferred to use the '=' and 'X' characters to denote matches or mismatches rather than the older 'M' character, which is ambiguous. * “Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively. * H can only be present as the first and/or last operation. * S may only have H operations between them and the ends of the CIGAR string. * For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined. * Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ

Global and local alignments

Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the

Needleman–Wunsch algorithm The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul ...

, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The

Smith–Waterman algorithm The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorit ...

is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place. Hybrid methods, known as semi-global or "glocal" (short for global-local) methods, search for the best possible partial alignment of the two sequences (in other words, a combination of one or both starts and one or both ends is stated to be aligned). This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap. Another case where semi-global alignment is useful is when one sequence is short (for example a gene sequence) and the other is very long (for example a chromosome sequence). In that case, the short sequence should be globally (fully) aligned but only a local (partial) alignment is desired for the long sequence. Fast expansion of genetic data challenges speed of current DNA sequence alignment algorithms. Essential needs for an efficient and accurate method for DNA variant discovery demand innovative approaches for parallel processing in real time.

Optical computing Optical computing or photonic computing uses light waves produced by lasers or incoherent sources for data processing, data storage or data communication for computing. For decades, photons have shown promise to enable a higher bandwidth than the ...

approaches have been suggested as promising alternatives to the current electrical implementations, yet their applicability remains to be teste

Pairwise alignment

Pairwise sequence alignment methods are used to find the best-matching piecewise (local or global) alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high similarity to a query). The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods; however, multiple sequence alignment techniques can also align pairs of sequences. Although each method has its individual strengths and weaknesses, all three pairwise methods have difficulty with highly repetitive sequences of low

information content In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative wa ...

- especially where the number of repetitions differ in the two sequences to be aligned.

Maximal unique match

One way of quantifying the utility of a given pairwise alignment is the ' maximal unique match' (MUM), or the longest subsequence that occurs in both query sequences. Longer MUM sequences typically reflect closer relatedness. in the

multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...

genomes In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding gen ...

computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...

. Identification of MUMs and other potential anchors, is the first step in larger alignment systems such as

MUMmer Mummers' plays are folk plays performed by troupes of amateur actors, traditionally all male, known as mummers or guisers (also by local names such as ''rhymers'', ''pace-eggers'', ''soulers'', ''tipteerers'', ''wrenboys'', and ''galoshins''). ...

. Anchors are the areas between two genomes where they are highly similar. To understand what a MUM is we can break down each word in the acronym. Match implies that the substring occurs in both sequences to be aligned. Unique means that the substring occurs only once in each sequence. Finally, maximal states that the substring is not part of another larger string that fulfills both prior requirements. The idea behind this, is that long sequences that match exactly and occur only once in each genome are almost certainly part of the global alignment. More precisely:

"Given two genomes A and B, Maximal Unique Match (MUM) substring is a common substring of A and B of length longer than a specified minimum length d (by default d= 20) such that * it is maximal, that is, it cannot be extended on either end without incurring a mismatch; and * it is unique in both sequences"

Dot-matrix methods

The dot-matrix approach, which implicitly produces a family of alignments for individual sequence regions, is qualitative and conceptually simple, though time-consuming to analyze on a large scale. In the absence of noise, it can be easy to visually identify certain sequence features—such as insertions, deletions, repeats, or

inverted repeat An inverted repeat (or IR) is a single stranded sequence of nucleotides followed downstream by its reverse complement. The intervening sequence of nucleotides between the initial sequence and the reverse complement can be any length including zero. ...

s—from a dot-matrix plot. To construct a dot-matrix plot, the two sequences are written along the top row and leftmost column of a two-dimensional

and a dot is placed at any point where the characters in the appropriate columns match—this is a typical

recurrence plot In descriptive statistics and chaos theory, a recurrence plot (RP) is a plot showing, for each moment i in time, the times at which the state of a dynamical system returns to the previous state at i, i.e., when the phase space trajectory visits rou ...

. Some implementations vary the size or intensity of the dot depending on the degree of similarity of the two characters, to accommodate conservative substitutions. The dot plots of very closely related sequences will appear as a single line along the matrix's

main diagonal In linear algebra, the main diagonal (sometimes principal diagonal, primary diagonal, leading diagonal, major diagonal, or good diagonal) of a matrix A is the list of entries a_ where i = j. All off-diagonal elements are zero in a diagonal matrix. ...

. Problems with dot plots as an information display technique include: noise, lack of clarity, non-intuitiveness, difficulty extracting match summary statistics and match positions on the two sequences. There is also much wasted space where the match data is inherently duplicated across the diagonal and most of the actual area of the plot is taken up by either empty space or noise, and, finally, dot-plots are limited to two sequences. None of these limitations apply to Miropeats alignment diagrams but they have their own particular flaws. Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect occurs when a protein consists of multiple similar

structural domain In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of s ...

Dynamic programming

The technique of

can be applied to produce global alignments via the Needleman-Wunsch algorithm, and local alignments via the Smith-Waterman algorithm. In typical usage, protein alignments use a

substitution matrix In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in t ...

to assign scores to amino-acid matches or mismatches, and a

gap penalty A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an a ...

for matching an amino acid in one sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix, but in practice often simply assign a positive match score, a negative mismatch score, and a negative gap penalty. (In standard dynamic programming, the score of each amino acid position is independent of the identity of its neighbors, and therefore base stacking effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.) A common extension to standard linear gap costs, is the usage of two different gap penalties for opening a gap and for extending a gap. Typically the former is much larger than the latter, e.g. -10 for gap open and -2 for gap extension. Thus, the number of gaps in an alignment is usually reduced and residues and gaps are kept together, which typically makes more biological sense. The Gotoh algorithm implements affine gap costs by using three matrices. Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account

frameshift Ribosomal frameshifting, also known as translational frameshifting or translational recoding, is a biological phenomenon that occurs during translation that results in the production of multiple, unique proteins from a single mRNA. The process can ...

mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. The

BLAST Blast or The Blast may refer to: * Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film) ...

and

suites provide basic tools for creating translated alignments (though some of these approaches take advantage of side-effects of sequence searching capabilities of the tools). More general methods are available from

open-source software Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Op ...

such a
GeneWise
The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of sequences or extremely long sequences.

Word methods

Word methods, also known as ''k''-tuple methods, are

heuristic A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...

methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools

FASTA FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics. History The original FASTA program ...

and the

family. Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated. In the FASTA method, the user defines a value ''k'' to use as the word length with which to search the database. The method is slower but more sensitive at lower values of ''k'', which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length ''k'', but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such a
EMBL FASTA
an
NCBI BLAST

Multiple sequence alignment

Multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...

is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and

mechanistic The mechanical philosophy is a form of natural philosophy which compares the universe to a large-scale mechanism (i.e. a machine). The mechanical philosophy is associated with the scientific revolution of early modern Europe. One of the first expos ...

information to locate the catalytic

active site In biology and biochemistry, the active site is the region of an enzyme where substrate molecules bind and undergo a chemical reaction. The active site consists of amino acid residues that form temporary bonds with the substrate (binding site) a ...

s of

enzyme Enzymes () are proteins that act as biological catalysts by accelerating chemical reactions. The molecules upon which enzymes may act are called substrates, and the enzyme converts the substrates into different molecules known as products. A ...

s. Alignments are also used to aid in establishing evolutionary relationships by constructing

phylogenetic tree A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spec ...

s. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to

NP-complete In computational complexity theory, a problem is NP-complete when: # it is a problem for which the correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by tryi ...

combinatorial optimization problems. Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences.

Dynamic programming

The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and

memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered, ...

, it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the ''n''-dimensional equivalent of the sequence matrix formed from two sequences, where ''n'' is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs"

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...

, has been implemented in th
MSA
software package.

Progressive methods

Progressive, hierarchical, or tree methods generate a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to

. Progressive alignment results are dependent on the choice of "most related" sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive multiple sequence alignment methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy. Many variations of the

Clustal Clustal is a series of widely used computer programs used in bioinformatics for multiple sequence alignment. There have been many versions of Clustal over the development of the algorithm that are listed below. The analysis of each tool and its a ...

progressive implementation are used for multiple sequence alignment, phylogenetic tree construction, and as input for

protein structure prediction Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different ...

. A slower but more accurate variant of the progressive method is known as

T-Coffee T-Coffee (Tree-based Consistency Objective Function for Alignment Evaluation) is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment. It can al ...

Iterative methods

Iterative methods attempt to improve on the heavy dependence on the accuracy of the initial pairwise alignments, which is the weak point of the progressive methods. Iterative methods optimize an

based on a selected alignment scoring method by assigning an initial global alignment and then realigning sequence subsets. The realigned subsets are then themselves aligned to produce the next iteration's multiple sequence alignment. Various ways of selecting the sequence subgroups and objective function are reviewed in.

Motif finding

Motif finding, also known as profile analysis, constructs global multiple sequence alignments that attempt to align short conserved

s among the sequences in the query set. This is usually done by first constructing a general global multiple sequence alignment, after which the highly conserved regions are isolated and used to construct a set of profile matrices. The profile matrix for each conserved region is arranged like a scoring matrix but its frequency counts for each amino acid or nucleotide at each position are derived from the conserved region's character distribution rather than from a more general empirical distribution. The profile matrices are then used to search other sequences for occurrences of the motif they characterize. In cases where the original

data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...

contained a small number of sequences, or only highly related sequences,

pseudocount In statistics, additive smoothing, also called Laplace smoothing or Lidstone smoothing, is a technique used to smooth categorical data. Given a set of observation counts \textstyle from a \textstyle -dimensional multinomial distribution with \ ...

s are added to normalize the character distributions represented in the motif.

Techniques inspired by computer science

A profile HMM modelling a multiple sequence alignment

A variety of general

optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

algorithms commonly used in computer science have also been applied to the multiple sequence alignment problem.

Hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...

s have been used to produce probability scores for a family of possible multiple sequence alignments for a given query set; although early HMM-based methods produced underwhelming performance, later applications have found them especially effective in detecting remotely related sequences because they are less susceptible to noise created by conservative or semiconservative substitutions.

Genetic algorithm In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to gene ...

s and

simulated annealing Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. It ...

have also been used in optimizing multiple sequence alignment scores as judged by a scoring function like the sum-of-pairs method. More complete details and software packages can be found in the main article

. The

Burrows–Wheeler transform The Burrows–Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated c ...

has been successfully applied to fast short read alignment in popular tools such as

Bowtie The bow tie is a type of necktie. A modern bow tie is tied using a common shoelace knot, which is also called the bow knot for that reason. It consists of a ribbon of fabric tied around the collar of a shirt in a symmetrical manner so that the ...

and BWA. See

FM-index In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,Paolo Ferragina and Giovanni Manz ...

Structural alignment

Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the

secondary Secondary may refer to: Science and nature * Secondary emission, of particles ** Secondary electrons, electrons generated as ionization products * The secondary winding, or the electrical or electronic circuit connected to the secondary winding i ...

and

tertiary structure Protein tertiary structure is the three dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains may int ...

of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through

X-ray crystallography X-ray crystallography is the experimental science determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to diffract into many specific directions. By measuring the angles ...

NMR spectroscopy Nuclear magnetic resonance spectroscopy, most commonly known as NMR spectroscopy or magnetic resonance spectroscopy (MRS), is a spectroscopic technique to observe local magnetic fields around atomic nuclei. The sample is placed in a magnetic fiel ...

). Because both protein and RNA structure is more evolutionarily conserved than sequence, structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity. Structural alignments are used as the "gold standard" in evaluating alignments for homology-based

because they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.

DALI

The DALI method, or

distance matrix In mathematics, computer science and especially graph theory, a distance matrix is a square matrix (two-dimensional array) containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the ''dist ...

alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences. It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the

Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cry ...

(PDB). It has been used to construct the FSSP structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed a
DALI
and the FSSP is located a
The Dali Database

SSAP

SSAP (sequential structure alignment program) is a dynamic programming-based method of structural alignment that uses atom-to-atom vectors in structure space as comparison points. It has been extended since its original description to include multiple as well as pairwise alignments, and has been used in the construction of the

CATH The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and coll ...

(Class, Architecture, Topology, Homology) hierarchical database classification of protein folds. The CATH database can be accessed a
CATH Protein Structure Classification

Combinatorial extension

The combinatorial extension method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment. Based on measures such as rigid-body root mean square distance, residue distances, local secondary structure, and surrounding environmental features such as residue neighbor

hydrophobic In chemistry, hydrophobicity is the physical property of a molecule that is seemingly repelled from a mass of water (known as a hydrophobe). In contrast, hydrophiles are attracted to water. Hydrophobic molecules tend to be nonpolar and, th ...

ity, local alignments called "aligned fragment pairs" are generated and used to build a similarity matrix representing all possible structural alignments within predefined cutoff criteria. A path from one protein structure state to the other is then traced through the matrix by extending the growing alignment one fragment at a time. The optimal such path defines the combinatorial-extension alignment. A web-based server implementing the method and providing a database of pairwise alignments of structures in the Protein Data Bank is located at th
Combinatorial Extension
website.

Phylogenetic analysis

Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness. The field of

phylogenetics In biology, phylogenetics (; from Greek language, Greek wikt:φυλή, φυλή/wikt:φῦλον, φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary his ...

makes extensive use of sequence alignments in the construction and interpretation of

s, which are used to classify the evolutionary relationships between homologous

gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...

s represented in the

genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...

s of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young

most recent common ancestor In biology and genetic genealogy, the most recent common ancestor (MRCA), also known as the last common ancestor (LCA) or concestor, of a set of organisms is the most recent individual from which all the organisms of the set are descended. The ...

, while low identity suggests that the divergence is more ancient. This approximation, which reflects the "

molecular clock The molecular clock is a figurative term for a technique that uses the mutation rate of biomolecules to deduce the time in prehistory when two or more life forms diverged. The biomolecular data used for such calculations are usually nucleoti ...

" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the

coalescence Coalescence may refer to: * Coalescence (chemistry), the process by which two or more separate masses of miscible substances seem to "pull" each other together should they make the slightest contact * Coalescence (computer science), the merging of ...

time), assumes that the effects of mutation and

selection Selection may refer to: Science * Selection (biology), also called natural selection, selection in evolution ** Sex selection, in genetics ** Mate selection, in mating ** Sexual selection in humans, in human sexuality ** Human mating strategie ...

are constant across sequence lineages. Therefore, it does not account for possible difference among organisms or species in the rates of

DNA repair DNA repair is a collection of processes by which a cell identifies and corrects damage to the DNA molecules that encode its genome. In human cells, both normal metabolic activities and environmental factors such as radiation can cause DNA dam ...

or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between

silent mutation Silent mutations are mutations in DNA that do not have an observable effect on the organism's phenotype. They are a specific type of neutral mutation. The phrase ''silent mutation'' is often used interchangeably with the phrase '' synonymous mutat ...

s that do not alter the meaning of a given

codon The genetic code is the set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets, or codons) into proteins. Translation is accomplished by the ribosome, which links ...

and other mutations that result in a different

being incorporated into the protein). More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes. Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly

because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is

NP-hard In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...

Assessment of significance

Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is formally possible that

convergent evolution Convergent evolution is the independent evolution of similar features in species of different periods or epochs in time. Convergent evolution creates analogous structures that have similar form or function but were not present in the last com ...

can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures. In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts. Methods of statistical significance estimation for gapped sequence alignments are available in the literature.

Assessment of credibility

Statistical significance indicates the probability that an alignment of a given quality could arise by chance, but does not indicate how much superior a given alignment is to alternative alignments of the same sequences. Measures of alignment credibility indicate the extent to which the best scoring alignments for a given pair of sequences are substantially similar. Methods of alignment credibility estimation for gapped sequence alignments are available in the literature.

Scoring functions

The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using substitution matrices that reflect the probabilities of given character-to-character substitutions. A series of matrices called PAM matrices (Point Accepted Mutation matrices, originally defined by

Margaret Dayhoff Margaret Belle (Oakley) Dayhoff (March 11, 1925 – February 5, 1983) was an American physical chemist and a pioneer in the field of bioinformatics. Dayhoff was a professor at Georgetown University Medical Center and a noted research biochem ...

and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as

BLOSUM In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based o ...

(Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. Gap penalties account for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function. It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.

Other biological uses

Sequenced RNA, such as

expressed sequence tags In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proc ...

and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about

alternative splicing Alternative splicing, or alternative RNA splicing, or differential splicing, is an alternative splicing process during gene expression that allows a single gene to code for multiple proteins. In this process, particular exons of a gene may be ...

and

RNA editing RNA editing (also RNA modification) is a molecular process through which some cells can make discrete changes to specific nucleotide sequences within an RNA molecule after it has been generated by RNA polymerase. It occurs in all living organisms ...

. Sequence alignment is also a part of

genome assembly In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in on ...

, where sequences are aligned to find overlap so that ''

contig A contig (from ''contiguous'') is a set of overlapping DNA segments that together represent a consensus region of DNA.Gregory, S. ''Contig Assembly''. Encyclopedia of Life Sciences, 2005. In bottom-up sequencing projects, a contig refers to ov ...

s'' (long stretches of sequence) can be formed. Another use is SNP analysis, where sequences from different individuals are aligned to find single basepairs that are often different in a population.

Non-biological uses

The methods used for biological sequence alignment have also found applications in other fields, most notably in

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...

and in social sciences, where the Needleman-Wunsch algorithm is usually referred to as

Optimal matching Optimal matching is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such dista ...

. Techniques that generate the set of elements from which words will be selected in natural-language generation algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of computer-generated mathematical proofs. In the field of historical and comparative

linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...

, sequence alignment has been used to partially automate the

comparative method In linguistics, the comparative method is a technique for studying the development of languages by performing a feature-by-feature comparison of two or more languages with common descent from a shared ancestor and then extrapolating backwards t ...

by which linguists traditionally reconstruct languages. Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time. See also Prinzie and Van den Poel's paper

Software

A more complete list of available software categorized by algorithm and alignment type is available at

sequence alignment software This list of sequence alignment software is a compilation of software tools and web portals used in pairwise sequence alignment and multiple sequence alignment. See structural alignment software for structural alignment of proteins. Database sear ...

, but common software tools used for general sequence alignment tasks include ClustalW2 and T-coffee for alignment, and BLAST and FASTA3x for database searching. Commercial tools such as DNASTAR Lasergene, Geneious, and

PatternHunter PatternHunter is a commercially available homology search instrument software that uses sequence alignment techniques. It was initially developed in the year 2002 by three scientists: Bin Ma, John Tramp and Ming Li. These scientists were driven by ...

are also available. Tools annotated as performin
sequence alignment
are listed in th
bio.tools
registry. Alignment algorithms and software can be directly compared to one another using a standardized set of

benchmark Benchmark may refer to: Business and economics * Benchmarking, evaluating performance within organizations * Benchmark price * Benchmark (crude oil), oil-specific practices Science and technology * Benchmark (surveying), a point of known elevatio ...

reference multiple sequence alignments known as BAliBASE. The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE. A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.

References

External links

* {{Authority control Bioinformatics algorithms Computational phylogenetics Evolutionary developmental biology Algorithms on strings